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Abstract. Since its inception in 1993, the ADS Abstract 
Service has become an indispensable research tool for as- 
tronomers and astrophysicists worldwide. In those seven 
years, much effort has been directed toward improving 
both the quantity and the quality of references in the 
database. From the original database of approximately 
160,000 astronomy abstracts, our dataset has grown al- 
most tenfold to approximately 1.5 million references cov- 
ering astronomy, astrophysics, planetary sciences, physics, 
optics, and engineering. We collect and standardize data 
from approximately 200 journals and present the resulting 
information in a uniform, coherent manner. With the co- 
operation of journal publishers worldwide, we have been 
able to place scans of full journal articles on-line back to 
the first volumes of many astronomical journals, and we 
are able to link to current version of articles, abstracts, and 
datasets for essentially all of the current astronomy liter- 
ature. The trend toward electronic publishing in the field, 
the use of electronic submission of abstracts for journal 
articles and conference proceedings, and the increasingly 
prominent use of the World Wide Web to disseminate in- 
formation have enabled the ADS to build a database un- 
paralleled in other disciplines. 

The ADS can be accessed at 
http: / /adswww. harvard.edu 

Key words: methods: data analysis - astronomical bib- 
liography - astronomical sociology 



1. Introduction 

Astronomers today are more prolific than ever before. 
Studies in publication trends in astronomy (Abt 1994, 
Abt 1995, Schulman et al. 1997) have hypothesized that 
the current explosion in published papers in astronomy is 
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due to a combination of factors: growth in professional 
society membership, an increase in papers by multiple 
authors, the launching of new spacecrafts, and increased 
competition for jobs and Pis in the field (since candidate 
evaluation is partially based on publication history). As 
the number of papers in the field grows, so does the need 
for tools which astronomers can use to locate that fraction 
of papers which pertain to their specific interests. 

The ADS Abstract Service is one of several biblio- 
graphic services which provide this function for astron- 
omy, but due to the broad scope of our coverage and the 
simplicity of access to our data, astronomers now rely ex- 
tensively on the ADS, and other bibliographic services not 
only link to us, but some have built their bibliographic 
search capabilities on top of the ADS system. The Inter- 
national Society for Optical Engineering (SPIE) and the 
NASA Technical Report Service (NTRS) are two such ser- 
vices. 

The evolution of the Astrophysics Data System (ADS) 
has been largely data-driven. Our search tools and index- 
ing routines have been modified to maximize speed and 
efficiency based on the content of our dataset. As new 
types of data (such as electronic versions of articles) be- 
came available, the Abstract Service quickly incorporated 
that new feature. The organization and standardization 
of the database content is the very core upon which the 
Abstract Service has been built. 

This paper contains a description of the ADS Abstract 
Service from a "data" point of view, specifically descrip- 
tions of our holdings and of the processes by which we 
ingest new data into the system. Details are provided on 
the organization of the databases (section 2), the descrip- 
tion of the data in the databases (section 3), the creation 
of bibliographic records (section 4) , the procedures for up- 
dating the database (section 5), and on the scanned arti- 
cles in the Astronomy database (section 6). We discuss the 
interaction between the ADS and the journal publishers 
(section 7) and analyze some of the numbers correspond- 
ing to the datasets (section 8). In conjunction with three 
other ADS papers in this volume, this paper is intended to 
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offer details on the entire Abstract Service with the hopes 
that astronomers will have a better understanding of the 
reference data upon which they rely for their research. 
In addition, we hope that researchers in other disciplines 
may be able to benefit from some of the details described 
herein. 

As is often the case with descriptions of active Internet 
resources, what follows is a description of the present sit- 
uation with the ADS Abstract Service. New features are 
always being added, some of which necessitate changes in 
our current procedures. Furthermore, with the growth of 
electronic publishing, some of our core ideas about bibli- 
ographic tools and requirements must be reconsidered in 
order to be able to take full advantage of new publishing 
technologies for a new millennium. 

2. The Databases 

The ADS Abstract Service was originally conceived of 
in the mid 1980's as a way to provide on-line access 
to bibliographies of astronomers which were previously 
available only through expensive librarian search services 
or through the A&A Abstracts series (Schmadcl 1979, 
Schmadel 1982, Schmadel 1989), published by the As- 
tronomisches Rechen-Institut in Heidelberg. While the 
ideas behind the Abstract Service search engine were being 
developed (sec Kurtz ct al. 2000, hereafter OVERVIEW), 
concurrent efforts were underway to acquire a reliable data 
source on which to build the server. In order to best de- 
velop the logistics of the search engine it was necessary 
to have access to real literature data from the past and 
present, and to set up a mechanism for acquiring data in 
the future. 

An electronic publishing meeting in the spring of 1991 
brought together a number of organizations whose ulti- 
mate cooperation would be necessary to make the system 
a reality (see OVERVIEW for details). NASA's Scientific 
and Technical Information Program (STI) offered to pro- 
vide abstracts to the ADS. STI's abstracts were a rewrit- 
ten version of the original abstracts, categorized and key- 
worded by professional editors. They not only abstracted 
the astronomical literature, but many other scientific dis- 
ciplines as well. With STI agreeable to providing the past 
and present literature, and the journals committed to pro- 
viding the future literature, the data behind the system 
fell into place. The termination of the journal abstracting 
by the STI project several years later was unfortunate, 
but did not cause the collapse of the ADS Abstract Ser- 
vice because of the commitment of the journal publishers 
to distribute their information freely. 

The STI abstracting approximately covered the period 
from 1975 to 1995. With the STI data alone, we estimated 
the completeness of the Astronomy database to be better 
than 90% for the core astronomical journals. Fortunately, 
with the additional data supplied by the journals, by SIM- 



BAD (Set of Identifications, Measurements, and Bibli- 
ographies for Astronomical Data, Egret & Wcnger 1988) 
at the CDS (Centre de Donnees Astronomiques de Stras- 
bourg), and by performing Optical Character Recognition 
(OCR) on the scanned table of contents (see section 6 
below), we are now closer to 99% complete for that pe- 
riod. In the period since then we arc 100% complete for 
those journals which provide us with data, and signifi- 
cantly less complete for those which do not (e.g. many ob- 
servatory publications and non-U. S. journals). The data 
prior to 1975 are also significantly incomplete, although 
we are currently working to improve the completeness of 
the early data, primarily through scanning the table of 
contents for journal volumes as they are placed on-line. 
We are 100% complete for any journal volume which we 
have scanned and put on-line, since we verify that wc have 
all bibliographic entries during the procedure of putting 
scans on-line. 

Since the STI data were divided into categories, it was 
easy to create additional databases with non-astronomical 
data which were still of interest to astronomers. The cre- 
ation of an Instrumentation database has enabled us to 
provide a database for literature related to astronomical 
instrumentation, of particular interest to those scientists 
building astronomical telescopes and satellite instruments. 
We were fortunate to get the cooperation of the SPIE 
very quickly after releasing the Instrumentation database. 
SPIE has become our major source of abstracts for the In- 
strumentation database now that STI no longer supplies 
us with data. 

Our Physics and Geophysics database, the third 
database to go on-line, is intended for scientists working in 
physics-related fields. We add authors and titles from all of 
the physics journals of the American Institute of Physics 
(AIP), the Institute of Physics (IOP), and the American 
Physical Society (APS), as well as many physics journals 
from publishers such as Elsevier and Academic Press (AP 
(AP)). 

The fourth database in the system, the Preprint 
database, contains a subset of the Los Alamos 
National Laboratory's (LANL) Preprint Archive 
(Los Alamos National Laboratory 1991). Our database 
includes the LANL astro-ph preprints which are re- 
trieved from LANL and indexed nightly through an 
automated procedure. That dataset includes preprints 
from astronomical journals submitted directly by authors. 

3. Description of the Data 

The original set of data from STI contained several basic 
fields of data (author, title, keywords, and abstracts) to 
be indexed and made available for searching. All records 
were keyed on STI's accession number, a nine-digit code 
consisting of a letter prefix (A or N) followed by a two-digit 
publication year, followed by a five-letter identifier (e.g. 
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A95-12345). Data were stored in files named by accession 
number. 

With the inclusion of data from other sources, pri- 
marily the journal publishers and SIMBAD, we extended 
STI's concept of the accession number to handle other 
abstracts as well. Since the ADS may receive the same 
abstract from multiple sources, we originally adopted a 
system of using a different prefix letter with the remain- 
der of the accession number being the same to describe 
abstracts received from different sources. Thus, the same 
abstract for the above accession number from STI would 
be listed as J95-12345 from the journal publisher and S95- 
12345 from SIMBAD. This allowed the indexing routines 
to consider only one instance of the record when indexing. 
Recently, limitations in the format of accession numbers 
and the desire to index data from multiple sources (rather 
than just STI's version) have prompted us to move to a 
data storage system based entirely on the bibliographic 
code. 

3.1. Bibliographic Codes 

The concept of a unique bibliographic code used 
to identify an article was originally conceived of by 
SIMBAD and NED (NASA's Extragalactic Database, 
Helou & Madore 1988). The original specification is de- 
tailed in Schmitz et al. 1995. In the years since, the ADS 
has adopted and expanded their definition to be able to 
describe references outside of the scope of those projects. 

The bibliographic code is a 19-charactcr string com- 
prised of several fields which usually enables a user to 
identify the full reference from that string. It is defined as 
follows: 

YYYYJJJJJVVVVMPPPPA 

where the fields are defined in Table 1. 

The journal field is left-justified and the volume and 
page fields are right -justified. Blank spaces and leading ze- 
roes are replaced by periods. For articles with page num- 
bers greater than 9999, the M field contains the first digit 
of the page number. The A field contains a colon (":") if 
there is no author listed. 

Creating bibliographic codes for the astronomical jour- 
nals is uncontroversial. Each journal typically has a 
commonly-used abbreviation, and the volume and page 
are easily assigned (e.g. 1999PASP..111..438F). Each vol- 
ume tends to have individual page numbering, and in 
those cases where more than one article appears on a page 
(such as errata) , a "Q" , "R" , "S" , etc. is used as the qual- 
ifier for publication to make bibliographic codes unique. 
When page numbering is not continuous across issue num- 
bers (such as Sky & Telescope) , the issue number is repre- 
sented by a lower case letter as the qualifier for publication 
(e.g. "a" for issue 1). This is because there may be multiple 
articles in a volume starting on the same page number. 

Creating bibliographic codes for the "grey" literature 
such as conference proceedings and technical reports is a 



more difficult task. The expansion into these additional 
types of data included in the ADS required us to mod- 
ify the original prototype bibliographic code definition in 
order to present identifiers which are easily recognizable 
to the user. The prototype definition of the bibliographic 
code suggested using a single letter in the second place of 
the volume field to identify non-standard references (cat- 
alogs, PhD theses, reports, preprints, etc.) and using the 
third and fourth place of that field to unduplicate and re- 
port volume numbers (e.g. 1981CRJS..R.3...14W). Since 
we felt that this created codes unidentifiable to the typ- 
ical user and since NED and SIMBAD did not feel that 
users needed to be able to identify books directly from 
their bibliographic codes, the ADS adopted different rules 
for creating codes to identify the grey literature. 

It is straightforward to create bibliographic codes for 
conference proceedings which are part of a series. For ex- 
ample, the IAU Symposia Series (IAUS) contains volume 
numbers and therefore fits the journal model for biblio- 
graphic codes. Other conference proceedings, books, col- 
loquia, and reports in the ADS typically contain a four 
letter word in the volume field such as "conf ' , "proc" , 
"book" , "coll" , or "rept" . When this is the case with a 
bibliographic code, the journal field typically consists of 
the first letter from important words in the title. This can 
give the user the ability to identify a conference proceed- 
ing at a glance (e.g. "ioda.book" for "Information and 
On-Line Data in Astronomy"). We will often leave the 
fifth place of the journal field as a dot for "readability" 
(e.g. 1995ioda.book..l75M). For most proceedings which 
are also published as part of a series (e.g. ASP Confer- 
ence Series, IAU Colloquia, AIP Conference Series), we 
include in the system two bibliographic codes, one as de- 
scribed above and one which contains the series name and 
the volume (see section 5.1). We do this so that users can 
see, for example, that a paper published in one of the 
"Astronomical Data Analysis Software and Systems" se- 
ries is clearly labelled as "adass" whereas a typical user 
might not remember which volume of ASPC contained 
those ADASS papers. This increases the user's readability 
of bibliographic codes. 

With the STI data, the details were often unclear as 
to whether an article was from a conference proceeding, 
a meeting, a colloquium, etc. We assigned those codes as 
best we could, making no significant distinction between 
them. For conference abstracts submitted by the editors 
of a proceedings prior to publication, we often do not have 
page numbers. In this case, we use a counter in lieu of a 
page number and use an "E" (for "Electronic" ) in the four- 
teenth column, the qualifier for publication. If these con- 
ference abstracts are then published, their bibliographic 
codes are replaced by a bibliographic code complete with 
page number. If the conference abstracts are published 
only on-line, they retain their electronic bibliographic code 
with its E and counter number. 
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Table 1. Bibliographic Code Definition (e.g. 1996A&AS..115....1S) 



Field 



Definition 



Example 



YYYY Publication Year 

JJJJJ Journal Abbreviation 

WW Volume Number 

M Qualifier for Publication 



PPPP Page Number 

A First Letter of the First Author's Surname 



1997 

ApJ, A&A, MNRAS, etc. 
480 

L (for Letter), P (for Pink Page) 
Q, R, S, etc. for unduplicating 
a, b, c, etc. for issue number 
129 
N 



There are several other instances of datasets where the 
bibliographic codes are non-standard. PhD theses in the 
system use "PhDT" as the journal abbreviation, contain 
no volume number, and contain a counter in lieu of a page 
number. Since PhD theses, like all bibliographic codes, are 
unique across all of the databases, the counter makes the 
bibliographic code an identifier for only one thesis. IAU 
Circulars also use a counter instead of a page number. 
Current Circulars are electronic in form, and although not 
technically a new page, the second item of an IAU Circu- 
lar is the electronic equivalent of a second page. Using the 
page number as a counter enables us to minimize use of 
the M identifier in the fourteenth place of a bibliographic 
code for unduplicating. This is desirable since codes con- 
taining those identifiers are essentially impossible to create 
a priori, either by the journals or by users. 

The last set of data currently included in the ADS 
which contain non-standard bibliographic codes is the 
"QB" book entries from the Library of Congress. QB is 
the Library of Congress code for astronomy-related books 
and we have put approximately 17,000 of these references 
in the system. Because the QB numbers are identifiers 
by themselves, wc have made an exception to the biblio- 
graphic code format to use the QB number (complete with 
any series or part numbers), prepended with the publica- 
tion year as the bibliographic code. Such an entry is easily 
identifiable as a book, and these codes enable users to lo- 
cate the books in most libraries. 

It is worth noting that while the bibliographic code 
makes identification simple for the vast majority of refer- 
ences in the system, we are aware of two instances where 
the bibliographic definition breaks down. The use of the 
fourteenth column for a qualifier such as "L" for ApJ Let- 
ters makes it impossible to use that column for unduplicat- 
ing. Therefore, if there are two errata on the same page 
with the same author initial, there is no way to create 
unique bibliographic codes for them. We are aware of only 
one such instance in the 33 years of publication of ApJ Let- 
ters. Second, with the electronic publishing of an increas- 
ing number of journals, the requirement of page numbers 
to locate articles becomes unnecessary. The journal Phys- 
ical Review D is currently using 6-digit article identifiers 



as page numbers. Since the bibliographic code allows for 
page numbers not longer than 5 digits, we are currently 
converting these 6-digit identifiers to their 5-digit hexa- 
gesimal equivalent. Both of these anomalies indicate that 
over the next few years we will likely need to alter the 
current bibliographic definition in order to allow consis- 
tent identification of articles for all journals. 

3.2. Data Fields 

The databases are set up such that some data fields are 
searchable and others are not. The searchable fields (title, 
author, and text) are the bulk of the important data, and 
these fields are indexed so that a query to the database 
returns the maximum number of meaningful results, (see 
Accomazzi et al. 2000, hereafter ARCHITECTURE). The 
text field is the union of the abstract, title, keywords, and 
comments. Thus, if a user requests a particular word in 
the text field, all papers are returned which contain that 
word in the abstract OR in the title OR in the keywords 
OR in the comments. Appendix A shows version 1.0 of the 
Extensible Markup Language (XML, see 3.4) Document 
Type Definition (DTD) for text files in the ADS Abstract 
Service. The DTD lists fields currently used or expected 
to be used in text files in the ADS (see section 5.2 for de- 
tails on the text files). We intend to reprocess the current 
journal and affiliation fields in order to extract some of 
these fields. 

Since STI ceased abstracting the journal literature, 
we decided to make the keywords themselves no longer a 
searchable entity for the time being - they are searchable 
only through the abstract text field. STI used a differ- 
ent standard set of keywords from the AAS journals, who 
use a different set of keywords from AIP journals (e.g. A J 
prior to 1998). In addition, keywords from a single journal 
such as the Astrophysical Journal (ApJ) have evolved over 
the years so that early ApJ volume keywords are not con- 
sistent with later volumes. In order to build one coherent 
set of keywords, an equivalence or synonym table for these 
different keyword sets must be created. We are investigat- 
ing different schemes for doing this, and currently plan to 
have a searchable keyword field again, which encompasses 
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all keywords in the system and equates those from differ- 
ent keyword systems which are similar (Lee et al. 1999). 

The current non-searchable fields in the ADS 
databases include the journal field, author affiliation, cat- 
egory, abstract copyright, and abstract origin. Although 
we may decide to create an index and search interface 
for some of these entities (such as category), others will 
continue to remain unsearchable since searching them is 
not useful to the typical user. In particular, author affilia- 
tions would be useful to search, however this information 
is inconsistently formatted so it is virtually impossible to 
collect all variations of a given institution for indexing co- 
herently. Furthermore, we have the author affiliations for 
only about half of the entries in the Astronomy database 
so we have decided to keep this field non-searchable. For 
researchers wishing to analyze affiliations on a large scale, 
we can provide this information on a collaborative basis. 

3.3. Data Sources 

The ADS currently receives abstracts or table of contents 
(ToC) references from almost two hundred journal sources. 
Tables 2, 3, and 4 list these journals, along with their bibli- 
ographic code abbreviation, source, frequency with which 
we receive the data, what data are received, and any links 
we can create to the data. ToC references typically con- 
tain only author and title, although sometimes keywords 
are included as well. The data are contributed via email, 
ftp, or retrieved from web sites around the world at a fre- 
quency ranging from once a week to approximately once 
a year. The term "often" used in the frequency column 
implies that we get them more frequently than once a 
month, but not necessarily on a regular basis. The term 
"occasionally" is used for those journals who submit data 
to us infrequently. 

Updates to the Astronomy and Instrumentation 
databases occur approximately every two weeks, or more 
often if logistically possible, in order to keep the database 
current. Recent enhancements to the indexing software 
have enabled us to perform instantaneous updates, trig- 
gered by an email containing new data (see ARCHITEC- 
TURE). Updates to the Physics database occurs approx- 
imately once every two months. As stated earlier, the 
Preprint database is updated nightly. 

3.4- Data Formats 

The ADS is able to benefit from certain standards 
which are adhered to in the writing and submission 
practices of astronomical literature. The journals share 
common abbreviations and text formatting routines 
which are used by the astronomers as well. The use 
of TcX (Knuth 1984) and LaTeX (Lamport 1986), and 
their extension to BibTeX (Lamport 1986) and AASTeX 



(American Astronomical Society 1999) results in common 
formats among some of our data sources. This enables the 
reuse of parsing routines to convert these formats to our 
standard format. Other variations of TeX used by journal 
publishers also allows us to use common parsing routines 
which greatly facilitates data loading. 

TeX is a public domain typesetting program designed 
especially for math and science. It is a markup system, 
which means that formatting commands are interspersed 
with the text in the TeX input file. In addition to com- 
mands for formatting ordinary text, TeX includes many 
special symbols and commands with which you can for- 
mat mathematical formulae with both ease and precision. 
Because of its extraordinary capabilities, TeX has become 
the leading typesetting system for science, mathematics, 
and engineering. It was developed by Donald Knuth at 
Stanford University. 

LaTeX is a simplified document preparation system 
built on TcX. Because LaTeX is available for just about 
any type of computer and because LaTeX files are ASCII, 
scientists are able to send their papers electronically to 
colleagues around the world in the form of LaTeX in- 
put. This is also true for other variants of TeX, although 
the astronomical publishing community has largely cen- 
tered their publishing standards on LaTeX or one of the 
software packages based on LaTeX, such as BibTeX or 
AASTeX. BibTeX is a program and file format designed 
by Oren Patashnik and Leslie Lamport in 1985 for the 
LaTeX document preparation system, and AASTeX is 
a LaTeX-based package that can be used to mark up 
manuscripts specifically for American Astronomical So- 
ciety (AAS) journals. 

Similar to the widespread acceptance of TeX and its 
variants, the extensive use of SGML (Standard Gener- 
alized Markup Language, Goldfarb & Rubinsky 1991) by 
the members of the publishing community has given us 
the ability to standardize many of our parsing routines. 
All data gleaned off the World Wide Web share fea- 
tures due to the use of HTML (Hyper Text Markup Lan- 
guage, Powell & Whitworth 1998), an example of SGML. 
Furthermore, the trend towards using XML (Extensible 
Markup Language, Harold 1999) to describe text doc- 
uments will enable us to share standard document at- 
tributes with other members of the astronomical commu- 
nity. XML is a subset of SGML which is intended to en- 
able generic SGML to be served, received, and processed 
on the Web in the way that is now possible with HTML. 
The ADS parsing routines benefit from these standards 
in several ways: we can reuse routines designed around 
these systems; we are able to preserve original text repre- 
sentations of entities such as embedded accents so these 
entities are displayed correctly in the user's browser; and 
we are able to capture value-added features such as elec- 
tronic URLs and email addresses for use elsewhere in our 
system. 
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Table 2. The ADS Astronomy Database 
Journal Source Full Name How Often 

See accompanying text file ADS_dataT2.txt for Table 2. 



a Letter codes describing what data are available 

b Astronomische Gesellschaft 

c University of Chicago Press 

d American Institute of Physics 

e Overseas Publishers Association 

f American Geophysical Union 

g Central Bureau for Astronomical Telegrams 

h Academic Press 

1 Universitad Nacional Autonoma de Mexico 
J Astronomisches Rechen-Institut 



Table 3. The ADS Instrumentation Database 
Journal Source Full Name How Often 

See accompanying text file ADS_dataT3.txt for Table 3. 



a Letter codes describing what data are available 

b Optical Society of America 

c The International Society for Optical Engineering (SPIE) 

d Institute of Physics 



Table 4. The ADS Physics Database 



Journal Source Full Name How Often 



See accompanying text file ADS_dataT4.txt for Table 4. 



a Letter codes describing what data are available 



In order to facilitate data exchange between different 
parts of the ADS, we make use of a tagged format similar 
to the "Refer" format (Jacobsen 1996). Refer is a prepro- 
cessor for the word processors nroff and trofT which finds 
and formats references. While our tagged formats share 
some common fields (%A, %T, %J, %D), the Refer for- 
mat is not specific enough to be used for our purposes. 
Items such as objects, URLs and copyright notices are be- 
yond the scope of the Refer syntax. Details on our tagged 
format are provided in Table 5. Reading and writing rou- 
tines for this format are shared by loading and indexing 
routines, and a number of our data sources submit ab- 
stracts to us in this format. 



4. Creating the Bibliographic Records 

One of the basic principles in the parsing and format- 
ting of the bibliographic data incorporated into the ADS 
database over the years has been to preserve as much of 
the original information as possible and delay any syn- 
tactic or semantic interpretation of the data until a later 



stage. From the implementation point of view, this means 
that bibliographic records provided to the ADS by pub- 
lishers or other data sources typically are saved as files 
which are tagged with their origin, entry date, and any 
other ancillary information relevant to their contents (e.g. 
if the fields in the record contain data which was translit- 
erated or converted to ASCII). 

For instance, the records provided to the ADS by the 
University of Chicago Press (the publisher of several major 
U.S. astronomical journals) are SGML documents which 
contain a unique manuscript identifier assigned to the pa- 
per during the electronic publishing process. This identi- 
fier is saved in the file created by the ADS system for this 
bibliographic entry. 

Because data about a particular bibliographic entry 
may be provided to the ADS by different sources and at 
different times, we adopted a multi-step procedure in the 
creation and management of bibliographic records: 

1) Tokenization: Parsing input data into a memory- 
resident data structure using procedures which are format- 
and source-specific. 
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Table 5. Tagged Format Definitions 



Tag Name Comment 



%R Bibliographic Code required 

%T Title required 

%A Author List required 

%D Publication Date required 

%B Abstract Text 



%C Abstract Copyright 

%E URL for Electronic Data Table 

%F Author Affiliation 

%G Origin 

%H Email 

%J Journal Name, Volume, and Page Range 

%K Keywords 

%L Last Page of Article 

%0 Object Name 

%Q Category 

%U URL for Electronic Document 

%V Language 

%W Database (AST, PHY, INST) 

%X Comment 

%Y Identifiers 

%Z References 



2) Identification: Computing the unique bibliographic 
record identifier used by the ADS to refer to this record. 

3) Instantiation: Creating a new record for each bibli- 
ography formatted according to the ADS "standard" for- 
mat. 

4) Extraction: Selecting the best information from the 
different records available for the same bibliography and 
merging them into a single entry avoiding duplication of 
redundant information. 

4-1. Tokenization 

The activity of parsing a (possibly) loosely-structured bib- 
liographic record is typically more of an art than a science, 
given the wide range of possible formats used by people for 
the representation and display of these records. The ADS 
uses the PERL language (Practical Extraction and Report 
Language, Wall & Schwartz 1991) for implementing most 
of the routines associated with handling the data. PERL is 
an interpreted programming language optimized for scan- 
ning and processing textual data. It was chosen over other 
programming languages because of its speed and flexibility 
in handling text strings. Features such as pattern match- 
ing and regular expression substitution greatly facilitate 
manipulating the data fields. To maximize flexibility in 
the parsing and formatting operations of different fields, 
we have written a set of PERL library modules and scripts 
capable of performing a few common tasks. Some that we 
consider worth mentioning from the methodological point 
of view are listed below. 



— Character set conversion: electronic data are often 
delivered to us in different character set encodings, 
requiring translation of the data stream in one of the 
standard character sets expected by our input scripts. 
The default character set that has been used by the 
ADS until recently is "Latin-1" encoding (ISO-8859-1, 
International Organization for Standardization 1987). 
We are now in the process of converting to the use 
of Unicode characters (Unicode Consortium 1996) 
encoded in UTF-8 (UCS Transformation Format, 
8-bit form). The advantage of using Unicode is its 
universality (all character sets can be mapped to 
Unicode without loss of information). The advantage 
of adopting UTF-8 over other encodings is mainly 
the software support currently available (most of the 
modern software packages can already handle UTF-8 
internally) . The adoption of Unicode and UTF-8 also 
works well with our adoption of XML as the standard 
format for bibliographic data. 

— Macro and entity expansion: Several of the highly 
structured document formats in use today rely on the 
strengths of the formatting language for the specifica- 
tion of some common formatting tasks or data tokens. 
Typically this means that LaTeX documents that are 
supplied to us make use of one or more macro pack- 
ages to perform some of the formatting tasks. Simi- 
larly, SGML documents will conform to some Docu- 
ment Type Definition (DTD) provided to us by the 
publisher, and will make use of some standard set 
of SGML entities to encode the document at the re- 
quired level of abstraction. What this means for us 
is that even if most of the input data comes to us 
in one of two basic formats (TeX/LaTeX/BibTeX or 
SGML/HTML/XML), we must be able to parse a large 
number of document classes, each one defined by a dif- 
ferent and ever increasing set of specifications, be it a 
macro package or a DTD. 

— Author name formatting: Special care has been taken 
in parsing and formatting author names from a variety 
of possible input formats to the standard one used by 
the ADS. The proper handling of author names is cru- 
cial to the integrity of the data in the ADS. Without 
proper author handling, users would be unable to get 
complete listings on searches by author names which 
comprise approximately two-thirds of all searches (see 
Eichhorn et al. 2000, hereafter SEARCH). 

Since the majority of our data sources do not provide 
author names in our standard format (last name, first 
name or initial), our loading routines need to be able 
to invert author names accurately handling cases such 
as multiple word last names (Da Costa, van dcr Bout, 
Little Marenin) and suffixes (Jr., Sr., III). Any titles in 
an author's name (Dr., Rev.) were previously omitted, 
but are now being retained in the new XML formatting 
of text files. 
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The assessment of what constitutes a multiple word 
last name as opposed to a middle name is non-trivial 
since some names, such as Davis, can be a first name 
(Davis Hartman), a middle name (A. G. Davis Philip), 
a last name (Robert Davis), or some combination 
(Davis S. Davis). Another example is how to determine 
when the name "Van" is a first name (Van Nguyen), a 
middle name (W. Van Dyke Dixon), or part of a last 
name (J. van Allen). Handling all of these cases cor- 
rectly requires not only familiarity with naming con- 
ventions worldwide, but an intimate familiarity with 
the names of astronomers who publish in the field. We 
are continually amassing the latter as we incorporate 
increasing amounts of data into the system, and as we 
get feedback from our users. 

— Spell checking: Since many of the historical records en- 
tered in the ADS have been generated by typesetting 
tables of contents, typographical errors can often be 
flagged in an automated way using spell-checking soft- 
ware. We have developed a PERL software driver for 
the international ispell program, a UNIX utility, which 
can be used as a spell-checking filter on all input to be 
considered textual information. A custom dictionary 
containing terms specific to astronomy and space sci- 
ences is used to increase the recognition capabilities of 
the software module. Any corrections suggested by the 
spell-checker module are reviewed by a human before 
the data are actually updated. 

— Language recognition: Extending the capability of the 
spell-checker, we have implemented a software module 
which attempts to guess the language of an input text 
buffer based on the percentage of words that it can 
recognize in one of several languages: English, Ger- 
man, French, Spanish, or Italian. This module is used 
to flag records to be entered in our database in a lan- 
guage other than English. Knowledge of the language 
of an abstract allows us to create accurate synonyms 
for those words (see ARCHITECTURE). 



4-2. Identification 



We call identification the activity of mapping the tokens 
extracted from the parsing of a bibliographic record into 
a unique identifier. The ADS adopted the use of bibli- 
ographic codes as the identifier for bibliographic entries 
shortly after its inception, in order to facilitate communi- 
cation between the ADS and SIMBAD. The advantage of 
using bibliographic codes as unique identifiers is that they 
can most often be created in a straightforward way from 
the information given in the list of references published in 
the astronomical literature, namely the publication year, 
journal name, volume, and page numbers, and first au- 
thor's name (see section 3.1 for details). 



4-3. Instantiation 

"Instantiation" of a bibliographic entry consists of the 
creation of a record for it in the ADS database. The 
ADS must handle receipt of the same data from multi- 
ple sources. We have created a hierarchy of data sources 
so that we always know the preferred data source. A ref- 
erence for which we have received records from STI, the 
journal publisher, SIMBAD, and NED, for example, must 
be in the system only once with the best information from 
each source preserved. When we load a reference into the 
system, we check whether a text file already exists for that 
reference. If there is no text file, it is a new reference and 
a text file is created. If there already is a text file, we 
append the new information to the current text file, cre- 
ating a "merged" text file. This merged text file lists every 
instance of every field that we have received. 

4-4- Extraction 

By "extraction" of a bibliographic entry we mean the pro- 
cedure used to create a unique representation of the bibli- 
ography from the available records. This is essentially an 
activity of data fusion and unification, which removes re- 
dundancies in the bibliographic records obtained by the 
ADS and properly labels fields by their characteristics. 
The extraction algorithm has been designed with our prior 
experience as to the quality of the data to select the best 
fields from each data source, to cross-correlate the fields 
as necessary, and to create a "canonical" text file which 
contains a unique instance of each field. Since the latter 
is created through software, only one version of the text 
file must be maintained; when the merged text file is ap- 
pended, the canonical text file is automatically recreated. 

The extraction routine selects the best pieces of in- 
formation from each source and combines them into one 
reference which is more complete than the individual ref- 
erences. For example, author lists received from STI were 
often truncated after five or ten authors. Whenever we 
have a longer author list from another source, that au- 
thor list is used instead. This not only recaptures missing 
authors, it also provides full author names instead of au- 
thor initials whenever possible. In addition, our journal 
sources sometimes omit the last page number of the refer- 
ence, but SIMBAD usually includes it, so we are able to 
preserve this information in our canonical text file. 

Some fields need to be labelled by their characteris- 
tics so that they are properly indexed and displayed. The 
keywords, for example, need to be attributed to a spe- 
cific keyword system. The system designation allows for 
multiple keyword sets to be displayed (e.g. NASA/STI 
Keywords and AAS Keywords) and will be used in 
the keyword synonym table currently under development 
(Lee et al. 1999). 

We also attempt to cross-correlate authors with their 
affiliations wherever possible. This is necessary for records 
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where the preferred author field is from one source and 
the affiliations are from another source. We attempt to 
assign the proper affiliation based on the last name and 
do not assume that the author order is accurate since we 
are aware of ordering discrepancies in some of the STf 
records. 

Through these four steps in the procedure of creating 
and managing bibliographic records, we are able to take 
advantage of receiving the same reference from multiple 
sources. We standardize the various records and present 
to the user a combination of the most reliable fields from 
each data source in one succinct text file. 

5. Updating the Database 

The software to update bibliographic records in the 
database consists of a series of PERL scripts, typically 
one per data source, which reads in the data, performs 
any special processing particular to that data source, and 
writes out the data to text files. The loading routines per- 
form three fundamental tasks: f) they add new biblio- 
graphic codes to the current master list of bibliographic 
codes in the system; 2) they create and organize the text 
files containing the reference data; and 3) they maintain 
the lists of bibliographic codes used to indicate what items 
are available for a given reference. 

5.1. The Master List 

The master list is a table containing bibliographic codes 
together with their publication dates (YYYYMM) and en- 
try dates into the system (YYYYMMDD). There is one 
master list per database with one line per reference. The 
most important aspect of the master list is that it re- 
tains information about "alternative" bibliographic codes 
and matches them to their corresponding preferred biblio- 
graphic code. An alternative bibliographic code is usually 
a reference which we receive from another source (primar- 
ily SfMBAD or NED) which has been assigned a different 
bibliographic code from the one used by the ADS. Some- 
times this is due to the different rules used to build bibli- 
ographic codes for non-standard publications (see section 
3.1), but often it is just an incorrect year, volume, page, or 
author initial in one of the databases (SfMBAD or NED 
or the ADS). In either case, the ADS must keep the al- 
ternative bibliographic code in the system so that it can 
be found when referenced by the other source (e.g. when 
SIMBAD sends back a list of their codes related to an 
object). The ADS matches the alternative bibliographic 
code to our corresponding one and replaces any instances 
of the alternative code when referenced by the other data 
source. Alternative bibliographic codes in the master list 
are prepended with an identification letter (S for SIM- 
BAD, N for NED, J for Journal) so that their origin is 
retained. 



While we make every effort to propagate corrections 
back to our data sources, sometimes there is simply a 
valid discrepancy. For example, alternative bibliographic 
codes are often different from the ADS bibliographic code 
due to ambiguous differences such as which name is the 
surname of a Chinese author. Since Americans tend to 
invert Chinese names one way (Zheng, Wei) and Euro- 
peans another (Wei, Zheng), this results in two different, 
but equally valid codes. Similarly, discrepancies in journal 
names such as BAAS (for the published abstracts in the 
Bulletin of the American Astronomical Society) and AAS 
(for the equivalent abstract with meeting and session num- 
ber, but no volume or page number) need different codes 
to refer to the same paper. Russian and Chinese transla- 
tion journals (Astronomicheskii Zhurnalvs. Soviet Astron- 
omy and Acta Astronomica Sinica vs. Chinese Astronomy 
and Astrophysics) share the same problem. These papers 
appear once in the foreign journal and once in the trans- 
lation journal (usually with different page numbers), but 
are actually the same paper which should be in the sys- 
tem only once. The ADS must therefore maintain multiple 
bibliographic codes for the same article since each journal 
has its own abbreviation, and queries for either one must 
be able to be recognized. The master list is the source of 
this correlation and enables the indexing procedures and 
search engine to recognize alternative bibliographic codes. 

5.2. The Text Files 

Text files in the ADS are stored in a directory tree by 
bibliographic code. The top level of directories is divided 
into directories with four-digit names by publication year 
(characters 1 through 4 of the bibliographic code). The 
next level contains directories with five-character names 
according to journal (characters 5 through 9), and the 
text files arc named by full bibliographic code under 
these journal directories. Thus, a sample pathname is 
1998/MNRAS/1998MNRAS.295...75E. Alternative bibli- 
ographic codes do not have a text file named by that code, 
since the translation to the equivalent preferred biblio- 
graphic code is done prior to accessing the text file. 

A sample text file is given in the appendices. Appendix 
B shows the full bibliographic entry, including all records 
as received from STI, MNRAS, and SIMBAD. It contains 
XML-tagged fields from each source, showing all instances 
of every field. Appendix C shows the extracted canonical 
version of the bibliographic entry which contains only se- 
lected information from the merged text file. This latter 
version is displayed to the user through the user interface 
(see SEARCH).' 

5.3. The Codes Files 

The third basic function of the loading procedures is to 
modify and maintain the listings for available items. The 
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ADS displays the availability of resources or information 
related to bibliographic entries as letter codes in the re- 
sults list of queries and as more descriptive hyperlinks in 
the page displaying the full information available for a bib- 
liographic entry. A full listing of the available item codes 
and their meaning is given in SEARCH. 

The loading routines maintain lists of bibliographic 
codes for each letter code in the system which are con- 
verted to URLs by the indexing routines (see ARCHITEC- 
TURE). Bibliographic codes are appended to the lists ei- 
ther during the loading process or as post-processing work 
depending on the availability of the resource. When elec- 
tronic availability of data coincides with our receipt of the 
data, the bibliographic codes can be appended to the lists 
by the loading procedures. When we receive the data prior 
to electronic availability, post-processing routines must be 
run to update the bibliographic code lists after we are no- 
tified that we may activate the links. 

6. The Articles 

The ADS is able to scan and provide free access to past 
issues of the astronomical journals because of the willing 
collaboration of the journal publishers. The primary rea- 
son that the journal publishers have agreed to allow the 
scanning of their old volumes is that the loss of individ- 
ual subscriptions does not pose a threat to their liveli- 
hood. Unlike many disciplines, most astronomy journals 
are able to pay for their publications through the cost 
of page charges to astronomers who write the articles and 
through library subscriptions which are unlikely to be can- 
celled in spite of free access to older volumes through the 
ADS. The journal publishers continue to charge for access 
to the current volumes, which is paid for by most institu- 
tional libraries. This arrangement places astronomers in a 
fortunate position for electronic accessibility of astronomy 
articles. 

The original electronic publishing plans for 
the astronomical community called for STELAR 
(STudy of Electronic Literature for Astronomical Re- 
search, van Steenbcrg 1992, van Steenberg et al. 1992, 
Warnock et al. 1992, Warnock et al. 1993) to handle the 
scanning and dissemination of the full journal articles. 
However, when the STELAR project was terminated 
in 1993, the ADS assumed responsibility for providing 
scanned full journal articles to the astronomical commu- 
nity. The first test journal to be scanned was the ApJ 
Letters which was scanned in January, 1995 at 300 dots 
per inch (dpi). It should be noted that those scans were 
intended to be 600 dpi and we will soon rescan them 
at the higher 600 dpi resolution. Complications in the 
journal publishing format (plates at the end of some 
volumes and in the middle of others) were noted and 
detailed instructions provided to the scanning company 
so that the resulting scans would be named properly by 
page or plate number. 



All of the scans since the original test batch have been 
scanned at 600 dpi using a high speed scanner and gener- 
ating a 1 bit/pixel monochrome image for each page. The 
files created are then automatically processed in order to 
de-skew and center the text in each page, resize images 
to a standard U.S. Letter size (8.5 x 11 inches), and add 
a copyright notice at the bottom of each page. For each 
original scanned page, two separate image files of different 
resolutions are generated and stored on disk. The avail- 
ability of different resolutions allows users the flexibility 
of downloading either high or medium quality documents, 
depending on the speed of their internet connection. The 
image formats and compression used were chosen based 
on the available compression algorithms and browser ca- 
pabilities. The high resolution files currently used are 600 
dpi, 1 bit/pixel TIFF (Tagged Image File Format) files, 
compressed using the CCITT Group 4 facsimile encod- 
ing algorithm. The medium resolution files are 200 dpi, 1 
bit/pixel TIFF files, also with CCITT Group 4 facsimile 
compression. 

Conversion to printing formats (PDF, PCL, and 
Postscript) is done on demand, as requested by the user. 
Similarly, conversion from the TIFF files to a low reso- 
lution GIF (Graphic Interchange Format) file (75, 100, or 
150 dpi, depending on user preferences) for viewing on the 
computer screen is done on demand, then cached so that 
the most frequently accessed pages do not need to be cre- 
ated every time. A procedure run nightly deletes the GIF 
files with the oldest access time stamp so that the total 
size of the disk cache is kept under a pre-defined limit. The 
current 10 GBytes of cache size in use at the SAO Article 
Server causes only files which have not been accessed for 
about a month to be deleted. Like the full-screen GIF im- 
ages, the ADS also caches thumbnail images of the article 
pages which provide users with the capability of viewing 
the entire article at a glance. 

The ADS uses Optical Character Recognition (OCR) 
software to gain additional data from TIFF files of article 
scans. The OCR software is not yet adequate for accu- 
rate reproduction of the scanned pages. Greek symbols, 
equations, charts, and tables do not translate accurately 
enough to remain true to the original printed page. For 
this reason, we have chosen not to display to the user 
anything rendered by the OCR software in an unsuper- 
vised fashion. However, we are still able to take advantage 
of the OCR software for several purposes. 

First, we are able to identify and extract the abstract 
paragraph(s) for use when we do not have the abstract 
from another source. In these cases, the OCR'd text is in- 
dexed so that it is searchable and the extracted image of 
the abstract paragraph is displayed in lieu of an ASCII 
version of the abstract. Extracting the abstract from the 
scanned pages is somewhat tedious, as it requires estab- 
lishing different sets of parameters for each journal, as well 
as for different fonts used over the years by the same jour- 
nal. The OCR software can be taught how to determine 
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where the abstract ends, but it does not work for every 
article due to oddities such as author lists which extend 
beyond the first page of an article, and articles which are in 
a different format from others in the same volume (e.g. no 
keywords or multiple columns). The ADS currently con- 
tains approximately 25,000 of these abstract images and 
more will be added as we continue to scan the historical 
literature. 

We are also currently using the OCR software to ren- 
der electronic versions of the entire scanned articles for 
indexing purposes. We will not use this for display to the 
users, but hope to be able to index it to provide the pos- 
sibility of full text searching at some future date. We esti- 
mate that the indexing of our almost one million scanned 
pages with our current hardware and software will take 
approximately two years of dedicated CPU time. 

The last benefit that we gain from the OCR software 
is the conversion of the reference list at the end of articles. 
We use parsed reference lists from the scanned articles to 
build citation and reference lists for display through the 
C and R links of the available items. Since reference lists 
are typically in one of several standard formats, we parse 
each reference for author, journal, volume and page num- 
ber for most journal articles, and conference name, au- 
thor, and page number for many conference proceedings. 
This enables us to build bibliographic code lists for refer- 
ences contained in that article (R links) and invert these 
lists to build bibliographic code lists of articles which cite 
this paper (C links). We are able to use this process to 
identify and therefore add commonly- cited articles which 
are currently missing from the ADS. This is usually data 
prior to 1975 or astronomy- related articles published in 
non-astronomy journals. 

The Article Service currently contains 250 GBytes of 
scans, which consists of 1,128,955 article pages comprising 
138,789 articles. These numbers increase on a regular ba- 
sis, both as we add more articles from the older literature 
and as we scan new journals. 

7. ADS/ Journal Interaction 

A description of the data in the ADS would be incom- 
plete without a discussion of the interaction between the 
ADS and the electronic journals. The data available on- 
line from the journal publishers is an extension of the data 
in the ADS and vice versa. This interaction is greatly fa- 
cilitated by the acceptance of the bibliographic code by 
many journal publishers as a means for accessing their 
on-line articles. 

Access to articles currently on-line at the journal sites 
through the ADS comprises a significant percent of the on- 
line journal access (see OVERVIEW). The best model for 
interaction between the ADS and a journal publisher is the 
University of Chicago Press (hereafter UCP), publisher of 
ApJ, ApJL, ApJS, AJ, and PASP. When a new volume 



appears on-line at UCP, the ADS is notified by email and 
an SGML header file for each of those articles is simulta- 
neously transferred to our site. The data are parsed and 
loaded into the system and appropriate links are created. 
However, prior to this, the UCP has made use of the ADS 
to build their electronic version through the use of our 
bibliographic code reference resolver. 

Our bibliographic code reference resolver 
(Accomazzi et al. 1999) was developed to provide 
the capability to automatically parse, identify, and 
verify citations appearing in astronomical literature. 
By verifying the existence of a reference through the 
ADS, journals and conference proceedings editors are 
able to publish documents containing hyperlinks pointing 
to stable, unique URLs. Increasingly more journals are 
linking to the ADS in their reference sections, providing 
users with the ability to read referenced articles with the 
click of a mouse button. 

During the copy editing phase, UCP editors query the 
ADS reference resolver and determine if each reference 
exactly matches a bibliographic code in the ADS. If there 
is a match, a link to the ADS is established for this en- 
try in their reference section. If there is not a match, one 
of several scenarios takes place. First, if it is a valid ref- 
erence not yet included in the ADS (most often the case 
for "fringe" articles, those peripherally associated with as- 
tronomy) , our reference resolver captures the information 
necessary to add it to our database during the next up- 
date. Second, if it is a valid reference unable to be parsed 
by the resolver (sometimes the case for conference pro- 
ceedings or PhD theses) , no action is taken and no link is 
listed in the reference section. Third, if there is an error 
in the reference as determined by the reference resolver, 
the UCP editors may ask for a correction or clarification 
from the authors. 

The last option demonstrates the power of the ref- 
erence resolver, which has been taught on a journal-by- 
journal basis how complete the coverage of that journal 
is in the ADS. Before the implementation of the reference 
resolver, UCP was able to match 72% of references in ApJ 
articles (E. Owens, private communication). Early results 
from the use of the reference resolver show that we are 
now able to match conference proceedings, so this num- 
ber should become somewhat larger. It is unlikely that we 
will ever match more than 90% of references in an article 
due to references such as "private communication" , "in 
press", and preprints, as well as author errors (see section 
8). Our own reference resolving of OCR'd reference lists 
shows that we can match approximately 86 

The ADS provides multiple ways for authors and jour- 
nal publishers to link to the ADS (see SEARCH). We make 
every effort to facilitate individuals and organizations link- 
ing to us. This is easily done for simple searches such as the 
verification of a bibliographic code or an author search for 
a single spelling. However, given the complexity of the sys- 
tem, these automated searches can quickly become compli- 
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cated. Details for conference proceedings editors or journal 
publishers who are interested in establishing or improving 
links to the ADS are available upon request. In particu- 
lar, those who have individual TeX macros incorporated 
in their references can use our bibliographic code resolver 
to facilitate linking to the ADS. 

8. Discussion and Summary 

As of this writing (12/1999), there are 524,304 references 
in the Astronomy database, 523,498 references in the In- 
strumentation database, 443,858 references in the Physics 
database, and 3467 references in the Preprint database, 
for a total of almost 1.5 million references in the system. 
Astronomers currently write approximately 18,000 jour- 
nal articles annually, and possibly that many additional 
conference proceedings papers per year. More than half of 
the journal papers appear in peer-reviewed journals. These 
numbers are more than double what they were in 1975, in 
spite of an increase in the number of words per page in 
most of the major journals (Abt 1995), and an increase in 
number of pages per article (Schulman et al. 1997). At the 
current rate of publication, astronomers could be writing 
25,000 journal papers per year by 2001 and an additional 
20,000 conference proceedings papers. Figure 1 shows the 
total number of papers for each year in the Astronomy 
database since 1975, divided into refereed journal papers, 
no n-refereed journal papers, and conferences (including re- 
ports and theses). There are three features worth noting. 
First, the increase in total references in 1980 is due to the 
inclusion of Helen Knudsen's Monthly Astronomy and As- 
trophysics Index, a rich source of data for both journals 
and conference proceedings which began coverage in late 
1979 and continued until 1995. Second, the recent increase 
in conferences included in the Astronomy database (start- 
ing around 1996) is due to the inclusion of conference 
proceedings table of contents provided by collaborating 
librarians and typed in by our contractors. Last, the de- 
crease in numbers for 1999 is due to coverage for that year 
not yet being complete in the ADS. 

The growth rate of the Instrumentation and Physics 
databases is difficult to estimate, primarily because we do 
not have datasets which are as complete as astronomy. In 
any case, the need for the organization and maintenance 
of this large volume of data is clearly important to every 
research astronomer. Fortunately, the ADS was designed 
to be able to handle this large quantity of data and to 
be able to grow with new kinds of data. New available 
item links have been added for new types of data as they 
became available (e.g. the links to complete book entries 
at the Library of Congress) and future datasets (e.g. from 
future space missions) should be able to be added in the 
same fashion. 

As with any dataset of this magnitude, there is some 
fraction of references in the system which are incor- 
rect. This is unavoidable given the large number of data 
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Fig. 1. Histogram showing the number of refereed journal pa- 
pers, non-refereed journal papers, and conferences (including 
reports and theses) for each year in the Astronomy database 
since 1975. 

sources, errors in indices and tables of contents as origi- 
nally published, and human error. In addition, many au- 
thors do not give full attention to verifying all references 
in a paper, resulting in the introduction of errors in many 
places. In a systematic study of more than 1000 references 
contained in a single issue of the Astrophysical Journal, 
Abt (1992) found that more than 12% of those contained 
errors. This number should be significantly reduced with 
the integration of the ADS reference resolver in the elec- 
tronic publishing process. However, any mistakes in the 
ADS can and will get propagated, so steps are being taken 
by us to maximize accuracy of our entries. 

Locating and identifying correlations between multi- 
ple bibliographic codes which describe the same article is 
a time-consuming and sometimes subjective task as many 
pairs of bibliographic codes need to be verified by manu- 
ally looking up papers in the library. We use the Abstract 
Service itself for gross matching of bibliographic codes, 
submitting a search with author and title, and consider- 
ing any resulting matches with a score of 1.0 as a potential 
match. These matches are only potential matches which 
require verification since authors can submit the same pa- 
per to more than one publication source (e.g. BAAS and 
a refereed journal), and since errata published with the 
same title and author list will perfectly match the original 
paper. 

When a volume or year is mismatched, it is usually 
obvious which of a pair of matched bibliographic codes 
is correct, but if a page number is off, the decision as to 
which code is correct cannot always be automated. We 
also need to consider matches with very high scores less 
than 1.0 since these are the matches where an author name 
may be incorrect. The correction of errors of this sort is 
ongoing work which is carried out as often as time and 
resources permit. 
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The evolution of the Internet and the World Wide 
Web, along with the explosion of astronomical services 
on the Web has enabled the ADS to provide access to our 
databases in an open and uniform environment. We have 
been able to hyperlink both to our own resources and to 
other on-line resources such as the journal bibliographies 
(Boyce & Biemesderfer 1996). As part of the international 
collaboration Urania (Universal Research Archive of Net- 
worked Information in Astronomy, Boyce 1998), the ADS 
enables a fully functioning distributed digital library of as- 
tronomical information which provides power and utility 
previously unavailable to the researcher. 

Perhaps the largest factor which has contributed to the 
success of the ADS is the willing cooperation of the AAS, 
CDS, and all the journal publishers. The ADS has largely 
become the means for linking together smaller pieces of 
a bigger picture, making an elaborate digital library for 
astronomers a reality. We currently collaborate with over 
fifty groups in creating and maintaining cross-links among 
data centers. These additional collaborations with individ- 
uals and institutions worldwide allow us to provide many 
value-added features to the system such as object informa- 
tion, author email addresses, mail order forms for articles, 
citations, article scans, and more. A listing of these col- 
laborations is provided in Table 6. Any omissions from 
this table are purely unintentional, as the ADS values all 
of our colleagues and the users benefit not only from the 
major collaborators but the minor ones as well, as these 
are often more difficult for users to learn about indepen- 
dently. Most of the abbreviations are listed in Tables 2,3, 
and 4. 

The successful coordination of data exchanges with 
each of our collaborators and the efforts which went into 
establishing them in the first place have been key to the 
success of the ADS. Establishing links to and from the 
journal publishers, changing these links due to revisions 
at publisher websites, and tracking and fixing broken links 
is all considered routine data maintenance for the system. 
Since it is necessary for us to maintain connectivity to ex- 
ternal sites, routine checks of sample links arc performed 
on a regular basis to verify that the links are still active. 

Usage statistics for the Abstract Service (see 
OVERVIEW) indicate that astronomers and librarians at 
scientific institutions are eager to take advantage of the 
information that the ADS provides. The widespread ac- 
ceptance of the ADS by the astronomical community is 
changing how astronomers do research, placing extensive 
bibliographic information at their fingertips. This enables 
researchers to increase their productivity and to improve 
the quality of their work. 

A number of improvements to the data in the ADS are 
planned for the near future. As always, we will continue 
our efforts to increase the completeness of coverage, par- 
ticularly for the data prior to 1975. We have collected most 
of the major journals back to the first issue for scanning 
and adding to the Astronomy database. In addition, we 



are scanning and OCR'ing table of contents for conference 
proceedings to improve our coverage in that area. We are 
currently OCR'ing full journal articles to provide full text 
searching and to improve the completeness of our refer- 
ence and citation coverage. Finally, as the ADS becomes 
commonplace for all astronomers, valuable feedback from 
our users to inform us about missing papers, errors in the 
database, and suggested improvements to the system serve 
to guide the future of the ADS and to ensure that the ADS 
continues to evolve into a more valuable research tool for 
the scientific community. 
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Appendix A: 

Version 1.0 of the XML DTD describing text files in the 
ADS Abstract Service. 

Document Type Definition for the ADS 
bibliographic records 



Syntax policy 



- The element names are in uppercase in order 
to help the reading. 

- The attribute names are preferably in 
lowercase 

- The attribute values are allowed to be of 
type CDATA to allow more flexibility for 
additional values; however, attributes 
typically may only assume one of a well- 
defined set of values 

- Cross-referencing among elements such as 
AU, AF, and EM is accomplished through the 
use of attributes of type IDREFS (for AU) 
and ID (for AF and EM) 



<!— BIBRECORD is the root element of the XML 
document. Attributes are: 



origin mnemonic indicating individual (s) 
or institution(s) who submitted 
the record to ADS 

lang language in which the contents of 
this record are expressed the 
possible values are language tags 
as defined in RFC 1766. 
Examples: lang="fr", lang="en" 



<! ELEMENT BIBRECORD ( METADATA?, 

TITLE?, 
AUTHORS? , 
AFFILIATIONS?, 
EMAILS?, 
FOOTNOTES?, 
BIBCODE, 
MSTRING, 
MONOGRAPH? , 
SERIES?, 
PAGE? , 
LP AGE?, 
COPYRIGHT? , 
PUBDATE, 
CATEGORIES*, 
COMMENTS* , 
ANOTE?, 
BIBTYPE?, 
IDENTIFIERS?, 
ORIGINS, 
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< ! ATTLIST BIBRECDRD 



<! 



OBJECTS*, 
KEYWORDS*, 
ABSTRACT* ) 

origin CDATA 
lang CDATA 



#REQUIRED 
#IMPLIED 



Generic metadata about the ADS record 
(rather than the publication) — > 
<! ELEMENT METADATA ( VERSION, 

CREATOR, 
CDATE, 
EDATE ) > 

<! — Versioning is introduced to allow parsers 
to detect and reject any documents not 
complying with the supported DTD — > 

<! ELEMENT VERSION ( #PCDATA ) > 

<! — CREATOR is purely informative — > 

<! ELEMENT CREATOR ( #PCDATA ) > 

<! — Creation date for the record — > 

<! ELEMENT CDATE ( YYYY-MM-DD ) > 

<! — Last modified date — > 

<! ELEMENT EDATE ( YYYY-MM-DD ) > 

<! — Title of the publication — > 

<! ELEMENT TITLE ( #PCDATA ) > 

<! ATTLIST TITLE lang CDATA #IMPLIED > 



<! ATTLIST AU AF IDREFS #IMPLIED 

EM IDREFS #IMPLIED 

FN IDREFS #IMPLIED > 

<! — AU subelements — > 

<! ELEMENT PREF ( #PCDATA ) > 

<! ELEMENT FNAME ( #PCDATA ) > 

<! ELEMENT LNAME ( #PCDATA ) > 

<! ELEMENT SUFF ( #PCDATA ) > 

<!— AFFILIATIONS is the wrapper element for 
the individual affiliation records, each 
represented as an AF element — > 

<! ELEMENT AFFILIATIONS ( AF+ ) > 

<! ELEMENT AF ( #PCDATA ) > 

<! — the value of the ident attribute should 

match one of the values assumed by the AF 
attribute in an AU element — > 

<! ATTLIST AF ident ID # REQUIRED > 

<! ELEMENT EMAILS ( EM+ ) > 
<! ELEMENT EM ( #PCDATA ) > 

<! — the value of the ident attribute should 

match one of the values assumed by the EM 
attribute in an AU element — > 

<! ATTLIST EM ident ID # REQUIRED > 

<! — FOOTNOTES and FN subelements are here for 

future use — > 
<! ELEMENT FOOTNOTES ( FN+ ) > 
<! ELEMENT FN ( #PCDATA ) > 

ident ID 



# REQUIRED > 



<! — AUTHORS contains only AU subelements, each 

one of them corresponding to a single author <! ATTLIST FN 
name — > 

<! ELEMENT AUTHORS ( AU+ ) > <!— BIBCODE; for a definition, see: 

http : //adsdoc . harvard . edu/abs_doc/bib_help . html 

<! — AU contains at least the person's last name http://adsabs.harvard.edu/cgi-bin/ 

(LNAME), and possibly the first and middle name(s) nph-bib_query?1995ioda.book. .259S 
(or just the initials) which would be stored hntp : //adsabs . harvard.edu/cgi-bin/ 

element FNAME. PREF and SUFF represent the nph-bib_query?1995VA 39R.272S 

salutation and suffix for the name. SUFF This identifier logically belongs to the 

typically is one of: Jr., Sr., II, III, IV. IDENTS element, but since it is the 

PREF is rarely used but is here for completeness, identifier used internally in the system, 



Typically we would store salutations such as 
"Rev." (for "Reverend"), or "Prof." (for 
"Professor") in this element. 

— > 

<! ELEMENT AU ( PREF?, 
FNAME? , 



it is important to have it in a prominent 
and easy to reach place. 



— > 



<! ELEMENT BIBCODE ( #PCDATA ) > 



LNAME, 
SUFF? ) > 

<! — The attributes AF and EM are used to cross 
reference author affiliations and email 



<!— MSTRING is the unformatted string for the 

monograph (article, book, whatever). Example: 
<MSTRING>The Astrophysical Journal, Vol. 526, 
n. 2, pp. L89-L92</MSTRING> 

— > 

addresses with the individual author records .<! ELEMENT MSTRING ( #PCDATA ) > 

This is the only exception of attributes in <! — MONOGRAPH is a structured record containing 
upper case. The typical use of this is: the fielded information about the monograph 

<AU AF="AF_1 AF_2" EM="EM_3"> . . . </AU> where the bibliographic entry appeared. 

Typically this is created by parsing the 
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text in the MSTRING element. Example: SERVOL?, 
<MTITLE>The Astrophysical Journal</MTITLE> SEREDITORS? , 

<V0LUME>526</V0LUME> SERBIBCODE? ) > 

<ISSUE>2</ISSUE> <!— Title, volume, and editors of conference 

<PUBLISHER>University of Chicago Press series — > 

</PUBLISHER> <! ELEMENT SERTITLE ( #PCDATA ) > 

— > <! ELEMENT SERVOL ( #PCDATA ) > 

<! ELEMENT MONOGRAPH ( MTITLE, <! ELEMENT SEREDITORS ( ED+ ) > 

VOLUME?, <! — Serial bibcode for publication (may coincide 

ISSUE?, with main bibcode) — > 

MNOTE? , <! ELEMENT SERBIBCODE ( #PCDATA ) > 

EDITORS?, 

EDITION?, <!— PAGE may have the attribute type set to 

PUBLISHER?, "s" for (sequential) the value associated 

LOCATION?, to it does not represent a printed volume 

MID* ) > number — > 

<! ELEMENT PAGE ( #PCDATA ) > 

<!— Monograph title (e.g. "Astrophysical Journal"^ ! ATPLIST PAGE type NMTOKEN #IMPLIED > 

<! ELEMENT MTITLE ( #PCDATA ) > 

<! ELEMENT VOLUME ( #PCDATA ) > <!— LPAGE gives the last page number (if known). 

< ! ATTLIST VOLUME type NMTOKEN #IMPLIED > Does not make sense if PAGE is type="s" — > 

<! ELEMENT ISSUE ( #PCDATA ) > <! ELEMENT LPAGE ( #PCDATA ) > 

<! — A note about the monograph as supplied by the 

publisher or editor — > <! — COPYRIGHT is just an unformatted string 

<! ELEMENT MNOTE ( #PCDATA ) > containing copyright information from 

<! — List of editor names as extracted from MSTRING. publisher — > 

Formatting is as for AUTHORS and AU elements <f ELEMENT COPYRIGHT ( #PCDATA ) > 
<! ELEMENT EDITORS ( ED+ ) > 

<! ELEMENT ED ( PREF? , <! ELEMENT PUBDATE ( YEAR, MONTH? ) > 

FNAME? , <! ELEMENT MONTH ( #PCDATA ) > 

LNAME , <! ELEMENT YEAR ( #PCDATA ) > 

SUFF? ) > 

<! — Edition of publication — > <! — CATEGORIES contain subelements indicating in 

<! ELEMENT EDITION ( #PCDATA ) > which subject categories the publication was 

<! — Name of publisher — > assigned. STI/RECON has always assigned a 

<! ELEMENT PUBLISHER ( #PCDATA ) > category for each entry in their system, but 

<! — Place of publication — > otherwise there is little else in our 

<! ELEMENT LOCATION ( #PCDATA ) > database. The attributes origin and system 

<! — MID represents the monograph identification as are used to keep track of the different 

supplied by the publisher. This may be useful in classifications used. 

correlating our record with the publisher's oniine 

offerings. The "system" attribute characteriSeELEMENT CATEGORIES ( CA+ ) > 
the system used to express the identifier — >< ! ATTLIST CATEGORIES origin NMTOKEN #IMPLIED 
<! ELEMENT MID ( #PCDATA ) > system NMTOKEN #IMPLIED > 

<! ATTLIST MID type NMTOKEN #IMPLIED > <! ELEMENT CA ( #PCDATA ) > 

<! — If the bibliographic entry appeared in a series-- Typically private fields supplied by the 

then the element SERIES contains information data source. For instance, SIMBAD and LOC 

about the series itself. Typically this consists provide comments about a bibliographic 

of data about a conference series (e.g. ASP entries — > 

Conference Series). Note that there may be <! ELEMENT COMMENTS ( C0+ ) > 

several SERIES elements, since some <! ATTLIST COMMENTS lang CDATA #IMPLIED 

publications belong to "subseries" within origin NMTOKEN #IMPLIED > 

a series. <! ELEMENT CO ( #PCDATA ) > 

— > 

<! ELEMENT SERIES ( SERTITLE, <!— Author note — > 
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<! ELEMENT ANDTE ( #PCDATA ) > 

<! — BIBTYPE describes what type of publication 
this entry corresponds to. This is 
currently limited to the following tokens 
(taken straight from the BibTeX 
classification) : 

article 

book 

booklet 

inbook 

incollection 

inproceedings 

manual 

masterthesis 

misc 

phdthesis 
proceedings 
techreport 
unpublished 

— > 

<! ELEMENT BIBTYPE ( #PCDATA ) > 
<! 



sources (E.g. STI , which keyed abstracts 
in most cases) . Therefore we allow several 
ABSTRACT elements within each record, each 
with a separate origin or language. 
The attribute type is used to keep track 
of how the abstract data was generated. 
For instance, abstract text generated by 
our OCR software will have: 

origin="ADS" type="0CR" lang="en" 



— > 



<! ELEMENT ABSTRACT 
< ! ATTLIST ABSTRACT 



<! 



( P+ ) > 

origin NMTOKEN #IMPLIED > 

type NMTOKEN #IMPLIED > 

lang CDATA #IMPLIED > 



- List of all known identifiers for this 
publication — > 
<! ELEMENT IDENTIFIERS ( ID+ ) > 

<! — Contents of an ID element is the identifier 

used by a particular publisher or institution? ! — A is the 
Examples: <! ELEMENT A ( 



Abstracts are composed of separate 
paragraphs which have mixed contents as 
listed below. All the subelements listed 
below have the familiar HTML meaning and 
are used to render the abstract text in a 
decent way — > 
<! ELEMENT P (#PCDATA |A| BR I PRE I SUP I SUB)* > 
<! — Line breaks (BR) and preformatted text (PRE) 
make it possible to display tables and other 
preformatted text. — > 
<! ELEMENT BR EMPTY > 

<! ELEMENT PRE (#PCDATA I A I BR I SUP I SUB )* > 
familiar anchor element. — > 
#PCDATA | BR I SUP I SUB ) * > 



— > 
< ! 
< ! 



< ! 



<ID origin="UCP" system="PUBID">38426</ID> <! ATTLIST A HREF CDATA #REQUIRED > 

<ID origin="STI" system="ACCN0">A90-12345</-lE»- SUP and SUB are superscripts and subscripts. 

In our content model, they are allowed to 
ELEMENT ID ( #PCDATA ) > contain additional SUP and SUB elements, 

ATTLIST ID origin NMTOKEN #IMPLIED although we may decide to restrict them to 

type NMTOKEN # REQUIRED > PCDATA at some point — > 

<! ELEMENT SUP ( #PCDATA I A I BR I SUP I SUB )* > 
— the collective list of institutions that have< gEkBMENT SUB ( #PCDATA I A I BR I SUP I SUB )* > 

us a record about this entry. — > 
ELEMENT ORIGINS ( 0R+ ) > 



< ! 

<! ELEMENT OR ( #PCDATA ) > 



Appendix B: 



<! — The list of objects associated with the 

publication — > 
<! ELEMENT OBJECTS ( 0B+ ) > 
<! ELEMENT OB ( #PCDATA ) > 

<! — Keywords assigned to the publication — > 
<! ELEMENT KEYWORDS ( KW+ ) > 

<! ATTLIST KEYWORDS Lang CDATA #IMPLIED 

origin NMTOKEN #IMPLIED 
system NMTOKEN # REQUIRED > 

<! ELEMENT KW ( #PCDATA ) > 

<! — An abstract of the publication. This is 

typically provided to us by the publisher, 
but may in some cases come from other 



A sample text file from the ADS Abstract Service showing 
XML markup for the full bibliographic entry, including 
records from STI, MNRAS, and SIMBAD. Items in bold 
are those selected to create the canonical text file shown 
in Appendix C. 
<?xml version="1.0"?> 

<!DOCTYPE ADS.BIBALL SYSTEM "ads.dtd"> 
<ADS_BIBALL> 

<BIB RECORD origin= "STI" > 

<TITLE> Spectroscopic confirmation of redshifts 
predicted by gravitational lensing</TITLE> 

<AUTHORS> 
<AU AF="1"> 
<FNAME>Tim< /FNAME> 
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<LNAME>Ebbels< /LNAME> 
</AU> 

<AU AF="1"> 

<FNAME>Richard< /FNAME> 

< LN AME > Ellis < /LNAME> 
</AU> 

<AU AF="2"> 

<FNAME> Jean-Paul< /FNAME> 

< LN AME > Kncib < /LNAME> 
</AU> 

<AU AF="2"> 

<FNAME> Jean-Francois< /FNAME> 
<LNAME>LeBorgne< /LNAME> 
</AU> 

<AU AF="2"> 

< FN AME > Roscr < /FN AME > 
<LNAME>Pello< /LNAME> 
</AU> 

<AU AF="3"> 
<FNAME>Ian< /FN AME > 
<LNAME>Smail< /LNAME> 
</AU> 

<AU AF="4"> 
<FNAME>Blai< /FN AME > 
<LNAME>Sanahuja< /LNAME> 
</AU> 

< /AUTHORS > 
<AFFILIATIONS> 

<AF idcnt="AF_l">Cambridge, Univ.</AF> 
<AF ident="AF_2">0bservatoire Midi-Pyrenees</AF> 
<AF idcnt="AF_3">Durham, Univ.</AF> 
<AF ident="AF_4">Barcelona, Univ.</AF> 
</AFFILIATIONS> 

<MSTRING>Royal Astronomical Society, Monthly 
Notices, vol. 295, p. 75</MSTRING> 
<MONOGRAPH> 

<MTITLE>Royal Astronomical Society, Monthly 

Notices</MTITLE> 

<VOLUME>295</VOLUME> 

</MONOGRAPH> 

<PAGE>75</PAGE> 

<PUBDATE> 

<YEAR>1998</YEAR> 

<MONTH>03< /MONTH> 

</PUBDATE> 

<CATEGORIES> 

<CA> Astrophysics</C A> 

<CATEGORIES> 

<BIBCODE>1998MNRAS.295...75E</BIBCODE 

<BIBTYPE>article</BIBTYPE> 

<IDENTIFIERS> 

<ID type= "ACCNO" > A98-51106< /ID> 

</IDENTIFIERS> 

<KEYWORDS systems "STI"> 

<KW> GRAVITATIONAL LENSES</KW> 

<KW>RED SHIFT</KW> 



<KW>HUBBLE SPACE TELESCOPE</KW> 

<KW>GALACTIC CLUSTERS</KW> 

<KW>ASTRONOMICAL 

SPECTROSCOPY< /KW> 

<KW>MASS DISTRIBUTION</KW> 

<KW>SPECTROGRAPHS</KW> 

<KW>PREDICTION ANALYSIS 

TECHNIQUES</KW> 

<KW>ASTRONOMICAL 

PHOTOMETRY</KW> 

</KEYWORDS> 

<ABSTRACT> 

We present deep spectroscopic measurements of 18 distant 
field galaxies identified as gravitationally lensed arcs in a 
Hubble Space Telescope image of the cluster Abell 2218. 
Rcdshifts of these objects were predicted by Kneib et al. 
using a lensing analysis constrained by the properties 
of two bright arcs of known rcdshift and other multiply 
imaged sources. The new spectroscopic identifications 
were obtained using long exposures with the LDSS-2 
spectrograph on the William Herschel Telescope, and 
demonstrate the capability of that instrument to reach 
new limits, R = 24; the lensing magnification implies true 
source magnitudes as faint as R = 25. Statistically, our 
measured redshifts are in excellent agreement with those 
predicted from Kneib et al.'s lensing analysis, and this 
gives considerable support to the rcdshift distribution 
derived by the lensing inversion method for the more 
numerous and fainter arclets extending to R = 25.5. We 
explore the remaining uncertainties arising from both the 
mass distribution in the central regions of Abell 2218 
and the inversion method itself, and conclude that the 
mean redshift of the faint field population at R = 25.5 
(B = 26-27) is low, (z = 0.8-1). We discuss this result 
in the context of rcdshift distributions estimated from 
multicolor photometry. 
<ABSTRACT> 
</BIBRECORD> 

<BIB RECORD origin^ "MNRAS" > 
<TITLE> Spectroscopic confirmation of rcdshifts pre- 
dicted by gravitational lensing</TITLE> 
<AUTHORS> 
<AU AF="1"> 
<FNAME>Tim< /FNAME> 
<LNAME>Ebbels</LNAME> 
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